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Abstract. - I propose to sharpen the index h, proposed by Hirsch as a useful index to characterize 
the scientific output of a researcher, by excluding the self-citations. Performing a self-experiment 
and also discussing in detail two anonymous data sets, it is shown that self-citations can signif- 
icantly reduce the h index in contrast to Hirsch's expectations. This result is confirmed by an 
analysis of 13 further data sets. 



Introduction. — About one year ago the physicist 
^ Hirsch [1] proposed an easily computable index h as 
• an estimate of the visibility, importance, significance, 
E^and broad impact of a scientist's cumulative research 
contribution. This index h is defined as the highest 
O number of papers of a scientist that received h or more 
— citations. Of course, for many people "it is distasteful to 
reduce a lifetime's work to a number" [2]. For others h is 
an " elegantly simple" measure [3] , which allows an easy 
i— h comparison of the scientific achievement of a scientist in 
C") an unbiased way by a single number. As it can be deter- 
mined easily by ordering the publication list according 
^^to the number of citations which is for example possible 
f- — using the Science Citation Index provided by Thomson 
(^>ISI in the Web of Science (WoS) data base, it has received 
""^immediate attention in the public [4] and the physics 
O community [5-7] and is already widely recognized as a 
jjy convenient measure in evaluations. Already a significant 
Amount of literature in informetrics [8-13] has been deal- 
'-^ing with this measure of visibility of a scientist. Different 
rTtyata sets have been evaluated to identify the most highly 
doited scientists in various fields [1,3,7,14]. A comparative 
k>( study on committee peer review [15] of post-doctoral 
5— 1 researchers in biomedicine suggested that the Hirsch 
index is indeed a promising (rough) measurement of the 
quality. The statistical correlation of the Hirsch index 
with standard bibliometric indicators and peer judgement 
was shown to be quite high for 147 chemistry research 
groups in the Netherlands [16]. A critical analysis of the 
Hirsch index of 187 evolutionary biologists and ecologists 
from the editorial boards of seven journals illustrates the 
risk of indiscriminate use of the index [2] . A quantitative 
investigation of the statistical reliability [6] has cast 
doubts on the accuracy and precision of the Hirsch index. 
Nevertheless the interest in this measure continues to 
grow [7-12,16-21]. 

It was shown [10] that the Hirsch index notion can 
be extended to the general framework of information 



production processes and that any system has a unique 
Hirsch index. Banks [21] has extended it to an index 
for scientific topics and compounds in order to identify 
hot topics and interesting materials. The Hirsch-Banks 
index is defined in analogy to h as the highest numbers 
of papers in a particular field or on a specific compound 
that received h or more citations. This extension has 
also received a lot of attention even beyond the scientific 
community, identifying nanotubes, nanowires and quan- 
tum dots as the most interesting topics in recent years. 
Other generalizations concern the comparison of entire 
research groups by their Hirsch index [16] and the utility 
for assessing the impact of journals [12,22]. 
When identifying hot topics, it is obvious that one will be 
dealing with a set of publications which are heavily cited 
within the field which means that they arc probably most 
often cited by people working on the same topic, i.e. by 
the same set of people who have written these publica- 
tions. However, when assessing the scientific achievement 
of an individual scientist, the analogous kind of citations 
within the data set, namely the self-citations should 
ideally not be included, because they are not reflecting 
the impact of a publication. Of course, self-citations 
increase the h index, but Hirsch has argued that the effect 
is relatively small and that the necessary corrections for h 
would involve only very few if any papers. An analysis of 
a group of scientists in ecology and evolution [2] , however, 
showed an average decrease of 12.3%. In contrast, the 
Hirsch indices of 31 influential scientists in information 
science dropped only between zero and three, on average 
by 0.9, or 6.6%, when self-citations were excluded [3]. In 
the present investigation I demonstrate that the influence 
of self-citations on the Hirsch index can be drastic, in 
particular for younger scientists with a low Hirsch index. 
Three different ways to sharpen the Hirsch index will be 
proposed. 



Michael Schreiber 



Data base. — It is a rather time-consuming task to 
identify all self-citations. Because of self-interest and of 
the fact that it is relatively easy to check ones own publica- 
tions and citations I first performed a self-experiment and 
investigated several ways to determine the self-citations by 
myself and by my co-authors. Excluding them, my Hirsch 
index dropped by 18%. Then I also analyzed the publica- 
tions of a somewhat older colleague who is working in a 
more topical field in a mainstream area. In contrast, I also 
investigated the records of a somewhat younger colleague, 
working in a less attractive field, who has published fewer 
papers. Their Hirsch indices also dropped significantly by 
13% and 46%, respectively. 

Before analyzing the self-citations, one has to make sure 
that the data base is correct. This concerns the usual dif- 
ficulties, that different persons with the same name and 
same initials are found. The often suggested solution to 
check the affiliation is rather complicated when researchers 
are concerned who have changed between various places. 
Moreover, my own university is an example, why the cor- 
relation with the affiliation is often misleading, because we 
not only changed our name between faculty, department 
and institute; but also between Hochschule, Technical Uni- 
versity, and University of Technology; and further from 
Karl-Marx-Stadt via Chemnitz-Zwickau to Chemnitz. An- 
other problem in establishing the data base is the possible 
different way of spelling names, which is particulary ev- 
ident for the transliteration of e.g. Russian authors, or 
names which have changed e.g. by marriage. In principle, 
for the identification of the self-citations the same difficul- 
ties occur. However, it is quite unlikely that a manuscript 
is cited by a different scientist with the same name, so 
that this problem does not occur in practice. On the other 
hand, different ways of spelling an author's name or en- 
tirely different names of the same author can easily mask 
self-citations so that care should be taken in these cases. 
Of course, missing citations because of misspelled names 
cannot be avoided, because they do not show up at all in 
the search. The data sets used below have been carefully 
checked with respect to the mentioned difficulties. In my 
own case the WoS search yielded 754 results out of which 
only 268 were my own publications. The full list would 
give me a flattering, but wrong h = 46 instead of h A = 27 
(The superscript is used to distinguish the different data 
sets.). The names of both colleagues whose publications 
are analyzed in detail below are not so common, so that in 
their cases the analysis was relatively easy, because nearly 
all papers which were found in the ISI data base for their 
names were really published by these colleagues. For the 
set B with 282 papers I analyzed only the 131 publications 
with 10 or more citations and found just two which did 
not appear in this author's publication lists. For the set 
C I confirmed that 87 of the listed 91 papers should be 
attributed to the colleague. In both cases there was no 
influence on the Hirsch index. 
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Fig. 1: Number of citations for my 54 most cited papers (dark 
grey/brown), own self-citations (hatched), and maximal num- 
ber of citations by one of the co-authors including myself (light 
grey /yellow). 



Self-citations. — Before coming to the analysis of 
these data sets, let me comment on the question, why 
self-citations may appear. One reason is, that they are 
really needed in the manuscript in order to avoid repeti- 
tion of previously described experimental setups, theoret- 
ical models, as well as results and conclusions which may 
be necessary for the discussion in a certain manuscript 
but need not be repeated in this manuscript. Such self- 
citations are of course completely legitimate. A second 
reason for self- citations is that probably everybody knows 
his own previous manuscripts best and therefore it is eas- 
ier to refer to these own papers when a citation is re- 
quired in a given context for a certain argument. This 
practice is already questionable, at least when the num- 
ber of such self-citations is relatively high. The third rea- 
son for self-citations is certainly disreputable: Due to the 
ever-increasing number of evaluations which are based on 
citation counts, it is of course tempting to enhance one's 
citation count by referring to the own papers for this very 
purpose. The Hirsch index is vulnerable to such practice, 
because it is a single number which can be relatively eas- 
ily enhanced by specifically citing those papers for which 
the citation count is close to but below the critical value 
h. For example, in my own case (see fig. 1) just one ci- 
tation of my 28th paper would be sufficient to increase 
the Hirsch index. However, this paper happens to be first 
manuscript that I have ever co-authored so that its "lim- 
ited period of popularity" [23] has long ended, it is also 
not a "sleeping beauty" [23] and therefore it is unlikely 
to be cited by somebody else. Therefore I would have to 
cite it myself, if I want this paper to have an effect on 
my Hirsch index. In future, when the Hirsch index has 
become - as I expect - more popular, such manipulations 
might become more severe. In any case, even the perfectly 
legitimate self-citations mentioned above should not be in- 
cluded in any measure of scientific achievement, therefore 
the self-citations should be excluded. 
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Fig. 2: Same as fig. 1 for the 61 most cited papers in data set Fig. 3: Same as fig. 1 for the 26 most cited papers in data set 
B. C. 



SCCs of the first kind: h a . — The problem is now 
how to identify self-citations. In the WoS search one can 
obtain the names of up to 100 citing authors for a given 
paper and how often these people cited the respective pa- 
per. Thus it is easy to identify how often somebody has 
cited his own paper. I call this the self-citation correc- 
tions (SCCs) of the first kind. The respective data are 
shown in figs. 1, 2, and 3 for the example data sets men- 
tioned above. In my own case, see fig. 1, eight papers 
dropped below the critical value of h A — 27, five of them 
even below the value h A = 24 (The subscript is used to 
label the different SCCs.). Fortunately two manuscripts 
with the full citation count between 24 and 27 remained 
in that range even after the SCCs had been taken into ac- 
count. Consequently, my Hirsch index was reduced only 
to h A = h A - 5 + 2 = 24, not to h A - 8 = 19. Of course, 
due to the strongly fluctuating number of self-citations, 
the publications have to be reordered by the number of ci- 
tations after the SCCs have been taken into account. The 
respective result is shown in fig. 4, confirming h A = 24. 
For the data set B in fig. 2, the SCCs are often drastic, like 
53 self-citations for the fifth paper, but usually leaving still 
a significant number of other citations. Consequently the 
SCCs do not influence the Hirsch index very strongly, they 
lead to a reduction from h B = 38 to — 34, as shown 
in fig. 5. In the case C, however, the SCCs in fig. 3 are so 
significant, that the citation counts of all manuscripts fall 
below the value h c = 13. However, 7 of these manuscripts 
have a corrected count of 7 or more citations, leading to 
the new = 7. Out of the 12 manuscripts, which origi- 
nally had between 7 and 12 citations, two remain in this 
range but cannot enhance the value, as shown in fig. 6. 

SCCs of the second kind: h c . — Of course, if a 
paper is cited by one of the co-authors, such a citation 
should also not be taken into account. Using again the 
above-mentioned ISI list of citing authors for a particu- 
lar publication, it is relatively easy to find the co-author 
with the highest number of citations for this particular 



publication. I call the reduction of the citation count by 
this number the SCCs of the second kind. For long au- 
thor lists, on first sight the analysis appears to be not 
so straightforward, because the WoS summaries show at 
most 3 authors. However, the "Format for Print Page" 
displays all co-authors. In my own case, the number of ci- 
tations for several manuscripts dropped significantly more 
by the SCCs of the second kind than by those of the first 
kind, as can also been deduced from fig. 1, in particular 
for order numbers 4, 11, 21, 27 - 29, 36 - 38, 50. Again 
a reordering of the manuscripts had to be performed, the 
result is included in fig. 4. The corresponding index h A , 
which is corrected for the (co-)author with the most self- 
citations, can be determined from fig. 4 as h A = 23. That 
means, that the SCCs of the second kind did further re- 
duce my Hirsch index, but only slightly although the ci- 
tation counts of several papers dropped. For the two col- 
leagues, the respective data are also included in figs. 2, 5 
and 3, 6, respectively. In case B, sometimes a co-author 
was an even more enthusiastic self-citer, see e.g. for the 
sixth paper in fig. 2, with 60 self-citations. Nevertheless, 
as this occurred again mostly for papers with a large ci- 
tation count, the effect on the Hirsch index is small, it 
is reduced to =33. In the case C rarely a co-author 
was more enthusiastically citing the own manuscripts than 
the investigated author himself, therefore in this case the 
Hirsch index remains at the value = = 7. 

Analyzing the author list for the citations of a particular 
publication, it is straightforward to identify all co-authors 
as long as they appear among the set of 100 citing authors 
to which ISI displays are limited. Of course, the effort is 
significantly higher than for the SCCs of the second kind, 
because now one has to look for all co-author names in 
often long lists of citing authors; usually one has to check 
the complete lists, because some co-authors, e.g. typically 
PhD students never appear. Therefore I have performed 
this analysis completely only for my own publications and 
for the relatively small data set in fig. 3. Summing the self- 
citations of all co-authors of course overshoots the aim, 
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Fig. 4: Number of citations as in fig. f (dark grey/brown), 
without my own self-citations, reordered (white), without max- 
imal number of any co-author self-citations, reordered (black), 
and without cumulative co-author self-citations, not reordered 
(medium grey/blue). Note that the latter histograms conceal 
the previous ones, so that in particular the columns of 2nd and 
3rd kind often do not show up, because they are not different 
from the 3rd and/or 4th kind. The reordering is not restricted 
to the 54 papers in fig. 1 but comprises the full data set. 

because the counted self-citations are not just additive as 
two authors of a paper may have written another paper 
together, citing the first one, which would be counted as 
a self-citation for both co-authors. This overestimate can 
be so severe that it can lead to negative values for the 
citation count of papers which are heavily cited by several 
co-authors. Nevertheless I have analyzed the data in figs. 1 
and 3 after subtracting the sum of all self-citations for 
each paper, resulting in a lower limit for the corrected 
Hirsch index of hf — 20 and h? = 5, respectively. For 
the data set B, the same analysis was performed only for 
the publications with 30 or more citations and yielded 
hf = 29. (Note that this result confirms that it is sufficient 
to analyze the publications with more than 29 citations.) 

SCCs of the third kind: h s . — The correct way 
of taking multiple co-author self-citations into account is 
obviously to check every citing paper for co-authorship. 
This yields the SCCs of the third kind. That requires an 
enormous amount of tedious work, which can be done rel- 
atively easy for one's own publications although it is still 
quite time consuming and error prone. Fortunately one 
can do this in the ISI citing author list by checking (ticking 
off) all co-author names and viewing the data, which gives 
a list and thus the number of cumulative self-citations of 
all co-authors. The results are included in figs. 4-6, and 
the analysis yields a reduction of the Hirsch index to the 
sharpened Hirsch index hf = 22, hf — 33, and hf = 7, 
respectively. It can be seen that the effect on the number 
of citations for many publications is zero compared to the 
SCCs of the second kind. Accordingly the reduction of 
the Hirsch index from h c to h s is small or zero. Therefore 
it is a rather safe assumption, that it is usually sufficient 
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Fig. 5: Same as fig. 4, for data set B. 

to perform this analysis only for those papers which are 
ranked in the vicinity of the index h c defined above taking 
the SCCs of the second kind into account. 
For the data set B of fig. 2 which is based on a large 
number of publications with a large number of citations 
and, most important for the amount of correlations, usu- 
ally many co-authors, an analysis of the SCCs of the third 
kind for only the 22 publications with a citation count 
between 28 and 56, i.e., between 85% and 170% of hf ap- 
peared to be appropriate a priori. In retrospect, it would 
have been more than sufficient to determine the cumula- 
tive self-citations for the 10 papers with a citation count 
between hf and about 1.2 hf, finding that although 3 out 
of these publications dropped below the value of hf = 33, 
the remaining were just sufficient to keep hf at hf. In 
fact, in this particular case, even an analysis of the 4 pa- 
pers with a citation count of exactly hf would have been 
enough. On the other hand, starting from the full cita- 
tion counts (i.e. not taking first the SCCs of the second 
kind into account) one would have had to analyze at least 
the citations of 26 papers falling originally into the range 
between 0.85 h B and 1.6 h B , in order to reach the correct 
hf = 33. 

For my own case an analysis of the 15 publications be- 
tween 0.85 hf and 1.7 hf yields the correct hf = 22, 
but a restriction to the 6 papers in the range between hf 
and 1.2 hf already misses one (the 17th in fig. 4) out of 
the 3 whose citation counts drop below hf. Starting from 
the full citation counts, the range of 0.85 h A to 1.7 h A 
comprising 21 papers would have been just sufficient to 
determine hf correctly. 

For case C, the range 0.85 hf to 1.7 hf covers 13 publi- 
cations including the ones most cited (after excluding the 
SCCs of the second kind). Therefore it is not surprising 
that this range is more than sufficient to determine hf. 
In fact, also in this case an analysis of the 4 papers at hf 
would have been enough to corroborate the value hf = 7, 
although 2 of these drop below hf. On the other hand, 
starting from the original citation counts (i.e. without 
considering SCCs), even the range of 0.85 h c to 1.7 h 
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Fig. 6: Same as fig. 4, for data set C. 



would have been insufficient for a correct analysis, one 
would have to start as low as 0.6 h c and include also the 
most cited papers to obtain the correct value of h^. 
In table 1 the discussed values are compiled. The rela- 
tive reduction from h to the sharpened Hirsch index h s is 
considerable. Interestingly, the absolute decrease is nearly 
the same in all the above analyzed cases, namely 5 or 6, 
although very different publication and citation patterns 
distinguish the cases. It is therefore only a conjecture, 
when I infer that such an absolute reduction might by 
typical. 

In order to test this conjecture I have analyzed also a 
fourth data set D reflecting the achievements of a promi- 
nent scientist in La Jolla, again finding an absolute de- 
crease of 5 which, however, amounts to a reduction of 
only 10%, because Hirsch's index is rather high. I have 
then investigated another 12 data sets of physicists which 
I know rather well so that it has been possible with a rea- 
sonable amount of effort to make sure that the data base 
is correct, in particular excluding publications of differ- 
ent persons with the same name and same initials, but 
on the other hand including publications with deviating 
spellings of the name (mainly due to missing second ini- 
tials or an umlaut in the name). The obtained results are 
also included in table 1 as data sets E-P, sorted by the 
(original) Hitsch index. It turned out, that the absolute 
decrease from the original Hirsch index to the sharpened 
Hirsch index was 3 or 4 in most cases, thus being some- 
what smaller than in the first 4 data sets. However, this is 
still significant, especially noting that the relative reduc- 
tion is between 20 and 25% in most cases. Of course, due 
to the small values, a difference of 1 or 2 in the results 
should not be overvalued. However, as one example I note 
that the sharpened Hirsch index makes a distinction be- 
tween data sets C and H much clearer than the original 
Hirsch index. On the other hand comparing data sets C 
and O, one finds from the original index the reasonable 
assumption that C is better than O. But the sharpened 
index suggests the opposite order. Data sets J and K are 
distinguishable by the sharpened index, but not by the 
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Table 1: Hirsch index without and with SCCs (data in sets 
A-D compiled August 2006, in sets E-P January 2007). The 
total number of publications, the highest citation count, and 
the relative reduction of the Hirsch index to the sharpened 
index are also given for each data set. 
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Conclusion and outlook. — In conclusion, Hirsch's 
conjecture that usually only very few if any papers need 
to be dropped from the h count, if self-citations were 
taken into account, has been shown to be unrealistic. It 
may well be true for the prominent physicists that he has 
mentioned in his paper [1]. Roediger [24] has also argued 
about the self-citations that for "people with very high 
counts, they aren't much of a problem." However, for the 
average scientist, this is not valid. I even believe it to be a 
good guess that for younger scientists with comparatively 
low Hirsch index, the influence of the SCCs is often 
relatively strong. Most of the data sets in table 1 are from 
younger scientists. But it is at this stage of the career 
where the Hirsch index is or will be probably most often 
used for the assessment of the scientific achievements of 
a scientist, be it for a promotion or for the comparison 
with competitors for an open position. One might argue 
from table 1 that the Hirsch index "only" renormalizes 
by about 20% due to SCCs and therefore remains to be 
a useful measure even with SCCs. For more prominent 
people this may be true, but for younger scientists the 
discussed deviations from the average reduction are 
important. Consequently, the Hirsch index should be 
used with reasonable care, and it would be good policy 
to take the SCCs into account. As mentioned above, it is 
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straightforward and easy to determine the SCCs of the 
first kind and it is also relatively easy to calculate the 
corrections of the second kind. Taking the third kind, 
i.e., the cumulative self-citations into account is of course 
the method of choice, but it is rather difficult to execute, 
unless an automatic correlation between author lists and 
citing author lists can be performed. 
Other corrections may also be reasonable in particular 
when comparing people working in different areas. It 
has already been observed by Hirsch [1] that citation 
patterns in different fields vary significantly. This was 
quantified [25] in terms of a scaling factor. Another 
correction with the (average) number of co-authors has 
been proposed [1,7,24,26] and a large impact especially 
in physics was found [7,26]. As the Hirsch index usually 
increases with the number of publications, it has been 
suggested to compare it with the average h for scientists 
in the same field and the same number of publications 
in order to detect those researchers who "clearly deviate 
from world standards" [25]. One should also be aware 
that the general search in the WoS data base does not 
take into account books, book chapters, or conference 
proceedings. For some fields these are less relevant, while 
in other fields they might be decisive for the impact of a 
scientist's research. Of course, it would also be interesting 
to investigate, how an individual's Hirsch index increases 
with time [1, 17]. 

Based on a large data set of publications [27] the distribu- 
tion of citations has been studied and a growing random 
network model was used to describe the citation statis- 
tics [28,29]. Citation patterns in a more homogeneous 
community in high energy physics have also been analyzed 
and modelled in detail [30,31]. As already mentioned, 
when one wants to identify hot fields of research the 
citations within a certain community are of interest and 
should be measured, so that the self-citations might even 
have some value and need not necessarily be excluded. On 
the other hand, it is well known that there exists schools, 
sometimes also called citation cartels, whose members 
try to increase their visibility by citing mostly friends 
and family. It would be an interesting exercise to exclude 
citations within such a school from the determination 
of the Hirsch index. This can in principle be done by 
compiling a list of all co-authors with whom a certain 
scientist has published any paper and to exclude from 
the citation list of every manuscript every citation by 
anybody from this list. When this cumulative co-author 
list increases with time because new co-authors appear 
on the list, then the self-citation-corrected count of older 
manuscripts can be decreased and thus the index can 
also decrease with time, which is not possible due the 
SCCs discussed above, nor it is possible for the original 
proposal of Hirsch. 

In any case, I believe that at least the own citations, i.e. 
the self-citations of the first kind, which can be most 
easily determined, should be excluded from any evalua- 
tion, because they can be most easily manipulated by the 



author. The temptation to increase one's Hirsch index 
oneself should be avoided, even though some journals 
explicitly suggest to their authors to cite themselves 
or other papers of the journal in order to increase the 
impact factor. This is of course understandable from 
their business point of view, but it is questionable from 
the scientific point of view. 
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